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Abstract: 

This  paper  explores  the  tradeoffs  involved  in  mapping  a  signals  intelligence  algorithm  to 
general-purpose  processor  and  field  programmable  gate  array  (FPGA)  based  technology. 
Specifically,  a  prototypical  signal  detection  algorithm  is  described.  This  algorithm  consists  of  an 
Fourier  transform  based  frequency  channelizer  followed  by  a  statistical  signal  detector. 

The  system  examined  consists  of  a  single  14-bit  real-valued  input  stream  sampled  at  100  MSa/s. 
With  a  Fourier  transform  overlap  factor  of  75%,  this  results  in  a  total  sustained  bandwidth  of  1 
GB/s.  The  bandwidth  is  too  large  for  a  single  commercial  off-the-shelf  (COTS)  processor  to  get 
on  and  off  board,  leading  to  systems  solutions  using  multiple  processors. 

The  multiple-processor  partitioning  problem  is  looked  at  from  both  the  time  and  frequency 
domains.  For  the  example  application,  the  strengths  and  weaknesses  of  each  strategy  are 
examined.  The  influence  of  the  choice  of  processing  platform  on  the  partitioning  affects  the  final 
solution  as  well.  General-purpose  processors  typically  run  at  very  high  speeds,  but  perform  only 
a  small  number  of  operations  per  clock  cycle.  FPGAs,  on  the  other  hand,  can  perform  thousands 
of  operations  per  clock  cycle,  but  operate  with  a  slower  clock  frequency.  These  differences,  as 
well  as  other  system  features  such  as  the  interprocessor  communication  subsystem,  dramatically 
affect  the  viability  of  potential  partitioning  solutions. 

It  is  shown  that  successful  multiprocessor  partitioning  depends  on  the  entire  system.  Of  critical 
importance  are  the  features  and  performance  of  the  processing  nodes  and  the  interprocessor 
communications  system.  When  the  requirements  are  greater  than  a  single  aspect  of  the  system 
can  handle,  this  paper  explores  the  possibility  of  utilizing  excess  capacity  in  other  areas  of  the 


system  to  balance  the  system  loading.  Finally,  some  of  the  issues  that  arise  from  extending  the 
system  to  multiple  antenna  streams  are  also  explored. 

Figure  1:  An  example  heterogeneous  multicomputing  computing  platform. 
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Figure  2:  Some  of  the  issues  associated  with  heterogeneous  multicomputing  applications. 
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r#  The  goal  of  the  acquisition  system  is  to  detect  the 

presence  of  new  signals  in  the  environment  in  a  timely 
manner  so  that  they  may  be  identified  and  exploited 

♦  A  homogeneous  system  based  on  general  purpose  processors  can  be  used  to  solve 
the  problem,  but  advances  in  FPGAs  allow  for  a  higher  processing  density,  especially 
in  the  channelizer 

♦  Current  FPGAs  can  offer  over  a  lOx  computational  density  improvement  over  general 
purpose  processors  for  certain  applications.  Unfortunately,  most  communication 
fabrics  have  not  scaled  at  the  same  rate 


♦  Typical  acquisition  systems  place  the  delay  memory  after  the  analog  to  digital 
converter.  For  demodulation,  a  digital  down  converter(DDC)  is  used  to 
heterodyne  and  filter  the  delayed  data  stream 

♦  When  the  number  of  signals  to  demodulate  becomes  very  large  (>100),  the  typical 
DDC-based  approach  becomes  cumbersome.  FFT  processing  can  allow  the 
simultaneous  down  conversion  and  filtering  of  thousands  of  signals.  The 
downside  is  an  increase  in  the  amount  of  memory  needed  to  give  the  same  time 
delay 
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Parameters: 

16384  point  real  DFT 
4:1  overlap  and 
windowing  (P  =  4) 
16384  input  sample 
maximum  latency 
requirement 
Output  8192  bins, 
complex,  24  bits  per 
component  (1200 
MB/s) 


800 
MB/s 

16384  point  real  FFT 

N- 1 

X(k)  =  YJx(n)WN 

n= 0 


•  The  channelizer  throughput  is  greater 
than  most  interconnect  fabrics  can 
support 

Need  to  partition  the  problem 


The  channelizer 
decomposes  the  input 
sample  stream  into 
frequency  channels  by 
performing  overlapped 
and  windowed,  short 
time  Fourier  transforms 
on  the  input  data  stream 
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The  new  energy  alarm  detects 
the  presence  of  “new”  signals 
and  performs  some  rudimentary 
external  measurements 

High-input  bandwidth  and 
computational  requirements 
necessitate  partitioning 

Partitioning  strategies 

♦  Commutation 

♦  Time  domain  broadcast 

♦  Frequency  division 


Peak 

Association 


New  Energy 
Alarm 

Alarm  Report 

Fc  =  45.325  MHz 
BW  =  30.21  kHz 
Tup  =  15.000 
Tdown  =  25.000 

Each  processing  partition  generates  the  entire 
frequency  sweep  for  a  range  of  time  slices 


Shared 


IPC 


New  Data 


In  this  case,  the  channelizer  is  split 
into  four  partitions 

Each  partition  processes  a  contiguous 
segment  of  the  input  data  stream 

By  partitioning  the  problem  in  this  manner, 
extra  overhead  to  handle  the  segment 
boundaries  is  incurred 

Also,  the  system  latency  • 
increases  due  to  time 
expansion  nature  of  the 
commutation  process  # 


p 


IPC 


In  this  example,  P-1  old  input  data  blocks 
need  to  be  received  and  P-1  blocks  need 
to  be  sent 

Total  I/O  overhead  is  2(P-1)  blocks  for 
transfers  greater  than  2P-1  blocks  of  data 
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To  meet  the  one  block  latency  requirement,  only  1/4  of  a 
block  of  new  data  is  passed  to  each  partition;  3/4  of  the 
data  comes  from  the  other  partitions.  This  results  in  the 
same  data  block  being  transferred  an  extra  3  times! 


Throughput  Overhead  for  Time  Commutation 


For  this  example, 
the  input  bandwidth 
increases  from  200 
MB/s  to  800  MB/s. 
We’re  going  the 
wrong  way! 
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Each  partition  receives  the 
entire  data  stream 

No  extra  I/O  overhead  is 
incurred 

Partitioning  does  not  add 
latency 


New  Data 
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Saved 
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No  communication  between  channelizer  partitions 

Still  need  to  send  all  of  the  channelizer  output  data  to 
the  detector  processors 

If  there  are  a  sufficient  number  of  detector  processing 
elements  allocated  and  they  are  located  correctly  in  the 
fabric,  then  I/O  bottlenecks  can  be  avoided 
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Each  fabric  connection  is  a  266  MB/s  half  duplex  link  to  the  crossbar. 
Typical  performance  is  around  250  MB/s 

To  accommodate  the  channelizer  output  rate,  the  output  stream  must  be 
split  across  multiple  connections 

This  complicates  the  data  flow  and  fully  utilizes  the  fabric  I/O  capacity  in 
many  places 
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Each  node  generates  a  subset 
of  the  frequency  bins  for  all 
time  slices 

Each  partition  needs  all  of  the 
time  series  data  for  each 
sweep 
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Able  to  pipeline  the 
channelizer  and  detection 
processing  with  minimal  inter¬ 
processor  communication 
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Each  partition  performs  the  front  part  of  the  FFT 
computations.  This  is  inefficient,  but  the  I/O  issues  are 
simpler 

Uniform  communication  between  partitions  at  partition 
boundaries  only 

♦  Single  element  communication  between  channelizer  partitions 

♦  Similar  communication  between  detector  partitions! 

So  why  the  unusual  allocation  of  the  frequency  bins? 
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•  The  algorithm  exploits  the  efficient  computation  of  the  DFT  of  a 
2N-point,  real-valued,  input  sequence  with  an  N-point  complex 
transform 

•  Since  the  input  data  is  real,  only  one  half  of  the  output  spectrum 
needs  to  be  computed 

G(k)  =  -[x{k)+  X*  (N -k)- j -W%N \x{k)~  X*  (N -£))}  for  k  =  0,1 . N-1. 


By  processing  a  range  of  X(k), 
X(N-k)  frequency  pairs,  twice  as 
many  output  points  can  be 
calculated  with  only  two  extra 
additions! 
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•  Redundant  computations  are  performed  in  each 
channelizer  partition  to  reduce  the  system  I/O 
requirements 

•  Unlike  the  time  partitioning  case,  most  of  the  I/O 
movement  occurs  locally  in  the  fabric 


©  2003  Mercury  Computer  Systems,  Inc. 


m  m  Computer  Systems,  Inc  a  h 

Mercury  |\/|  jj 


77»e  Ultimate 
Performance  Machine 

Abstract 


Presentation 


AID 


Frequency 

Channelizer 


AID 

Frequency 

• 

Channelizer 

Frequency 

Channelizer 


Eigen-analysis 


Detector 


Back  to  Agenda 


Next  Session 


Adds  an  extra  data 
dimension  to  the 
problem 

Typical  antenna  arrays 
consist  of  4  -  8  antenna 
elements 


The  detection  algorithms  are  different  than  the  single  antenna 
case.  Typically,  eigenspace  methods  are  employed 

♦  The  channelizer  is  similar  between  the  single  and  multi-antenna  cases 

♦  Detection  processing  requires  the  time  series  data  for  each  frequency  bin 

♦  Log  magnitude  computation  is  not  required 

♦  Eigenvalues  and  eigenvectors  are  computed  for  each  bin,  across  all  antennas 

As  shown  in  the  previous  example,  frequency  domain 
partitioning  has  advantages  when  I/O  bound  conditions  occur 

♦  The  majority  of  the  data  movement  is  local 

♦  High-speed  local  interconnects  may  be  used  instead  of  the  fabric  connections 
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